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Abstract 

o 

■ We study optimal solutions to an abstract optimization problem for measures, which 
— I is a generalization of classical variational problems in information theory and statistical 

physics. In the classical problems, information and relative entropy are defined using 
\ the Kullback-Leibler divergence, and for this reason optimal measures belong to a one- 

parameter exponential family. Measures within such a family have the property of mutual 
absolute continuity. Here we show that this property characterizes other families of opti- 
mal positive measures if a functional representing information has a strictly convex dual. 
Mutual absolute continuity of optimal probability measures allows us to strictly separate 

■ deterministic and non-deterministic Markov transition kernels, which play an important 
^ \ role in theories of decisions, estimation, control, communication and computation. We 
+-J ■ show that deterministic transitions are strictly sub-optimal, unless information resource 

" with a strictly convex dual is unconstrained. For illustration, we construct an example 

where, unlike non-deterministic, any deterministic kernel either has negatively infinite 
expected utility (unbounded expected error) or communicates infinite information. 



in 
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^ . 1 Introduction 

^ ■ This work was motivated by the fact that probability measures within an exponential family, 

which are solutions to variational problems of information theory and statistical physics, are 
mutually absolutely continuous. Thus, we begin by clarifying and discussing this property in 
the simplest setting. Let be a finite set, and let : H — )■ M be a real function. Consider the 
family {y^ }x of real functions :Q.^W, indexed by j3 > 0: 
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X- yf,{co)=e^^'^'>^yo{co), yo{co)>0 (1) 

H ! 

The elements of {yp}x represent one-parameter exponential measures y^{E) = Y^weEypi^) 
on H, and normalized elements Pp{co) = yj}{co)/yp{D.) are the corresponding exponential 
probability measures. Of course, exponential measures can be defined on an infinite set, for 
example, as elements of the Banach space Y := ^(n,M, |j • |[i) of real Radon measures on 
a locally compact space H |11|. In this case, x and e' are elements of the normed algebra 
X := Cc(n,M, II • ||oo) of continuous functions with compact support in Q.. As will be clarified 
later, Y can be considered not only as the dual of X, but also as a module over algebra X, which 
explains the definition of an exponential family ([T]) as multiplication of yo £Y by elements 
of X. Furthermore, for some yo, exponential measures are finite even if function x is not 
continuous, has non-compact support and unbounded. A similar construction can be made in 
the case when X is a non-commutative *-algebra, such as the algebra of compact Hermitian 
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operators on a separable Hilbert space used in quantum probability theory. However, quantum 
exponential measures can be defined in different ways, such as yp := exp{px + lnyo) or yp := 

1/2 1 /2 

exp(j8jc)yQ , which are not equivalent. 

One property that characterizes all these exponential measures is that elements within a 
family are mutually absolutely continuous. We remind that measure y is absolutely continuous 
with respect to measure z, if z{E) = implies y{E) = for all E in the a-ring of subsets of 
Q.. Mutual absolute continuity is the case when the implication holds in both directions. It 
is easy to see from equation ([T]) that exponential measures within one family have exactly the 
same support and are mutually absolutely continuous. This property is particularly important, 
when measures are considered on a composite system, such as a direct product of two sets 
Q.=AxB. Normalized measures on such Q. are joint probability measures P{A x B) uniquely 
defining conditional probabilities P{A \ B) (i.e. Markov transition kernels). Observe now that 
if P{A X B) and P{A)P{B) (product of marginals) are mutually absolutely continuous, then 
P{a \b) >0 for all a G A such that P{a) > 0. Conditional probability with this property is non- 
deterministic, because several elements a € A can be in the 'image' of b ^ B. Clearly, all joint 
probability measures within an exponential family define such non-deterministic transition 
kernels. 

Another, perhaps the most important, property of exponential families is that they are, in 
a certain sense, optimal. It is well-known in mathematical statistics that the lower bound for 
the variance of the unbiased estimator of an unknown parameter, defined by the Rao-Cramer 
inequality, is attained if and only if the probability distribution is a member of an exponential 
family lfT3l[3Tl . In statistical physics, it is known that exponential distributions (i.e. Boltzmann 
or Gibbs distributions) maximize entropy of a thermodynamical system under a constraint on 
energy ifTTll . In information theory, exponential transition kernels are known to maximize a 
channel capacity ll33l[34l[35l . and they are used in some randomized optimization techniques 
(e.g. |20]) as well as various machine learning algorithms 139]. A one-parameter exponential 
family has been studied in information geometry, and it was shown to be a Banach space 
with an Orlicz norm 1.30 J . Similar constructions have been considered in quantum probability 
MM- 

Optimality of exponential families of measures on one hand and their mutual absolute 
continuity on the other is a particularly interesting combination, because it seems that for the 
first time we have an optimality criterion, with respect to which all deterministic transitions 
between elements of a composite system are strictly sub-optimal. This appears to have impor- 
tance not only for information and communication theories, but also for theories of computa- 
tional and algorithmic complexity, because Markov transition kernels can be used to represent 
various input-output systems, including computational systems and algorithms. Thus, under- 
standing the relation between mutual absolute continuity within some families of measures 
and their optimality was the main motivation for this work. 

It is well-known, and will be reminded later in this paper, that a one-parameter expo- 
nential family of probability measures is the solution to a variational problem of minimizing 
KuUback-Leibler (KL) divergence [23 ] of one probability measure from another subject to a 
constraint on the expected value. In fact, the logarithmic function, which appears in the defini- 
tion of the KL-divergence, is precisely the reason why the exponential function appears in the 
solutions. However, mutual absolute continuity, which for composite systems implies the non- 
deterministic property of conditional probabilities, is not exclusive to families of exponential 
measures. Indeed, geometrically, this property simply means that measures are in the inte- 
rior of the same positive cone, defined by their common support. Thus, our method is based 
on a generalization of the above mentioned variational problem by relaxing the definition of 
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information and then employing geometric analysis of its solutions. 

In the next section, we introduce the notation, define the generalized optimization problem 
and recall some basic relevant facts. An abstract information resource will be represented by 
a closed functional F : F — > MU {°°}, defined on the space Y of measures, and such that its 
values F{y) can be associated with values /(3',3'o) of some information distance (e.g. the KL- 
divergence). In Section |3] we establish several properties of optimal solutions. In particular, 
we prove in Proposition |3] that the optimal value function is order isomorphism putting infor- 
mation in duality with expected utility of an optimal system. These results are then used in 
Section|4]to prove a theorem relating mutual absolute continuity of optimal positive measures 
to strict convexity of functional F*, the Legendre-Fenchel dual of F representing information 
resource. We show that strict convexity of F* is necessary to separate different variational 
problems by optimal measures, and for this reason it appears to be a natural minimal require- 
ment on information, generalizing the additivity axiom. Because proof of mutual absolute 
continuity does not depend on commutativity of algebra X, pre-dual of Y, these results apply 
to a general, non-commutative setting used in quantum probability and information theories. 
In Section \5\ we discuss optimal Markov transition kernels (conditional probabilities) in the 
classical (commutative) setting, which is done for simplicity reasons. We shall recall several 
facts about transition kernels, information capacity of memoryless channels they represent and 
the corresponding variational problems. The main result of this section is a theorem separat- 
ing deterministic and non-deterministic kernels. We show how mutual absolute continuity of 
optimal Markov transition kernels implies that optimal transitions are non-deterministic; de- 
terministic transitions are strictly suboptimal if information, understood broadly here, is con- 
strained. This result will be illustrated by an example, where any deterministic kernel either 
has a negatively infinite expected utility (unbounded expected error) or communicates infinite 
information; a non-deterministic kernel, on the other hand, can have both finite expected util- 
ity and finite information. In the end of the section we shall consider applications of this work 
to theories of algorithms and computational complexity. We shall discuss how deterministic 
and non-deterministic algorithms can be represented by Markov transition kernels between the 
space of inputs and the space of output sequences, and how constraints on the expected utility 
or complexity of the algorithms are related to variational problems studied in this work. The 
paper concludes by a summary and discussion of the results. 

2 Preliminaries 

This work is based on a generalization of classical variational problems of information theory 
and statistical physics, which can be formulated as follows. Let (Q.,^) be a measurable set and 
let be the set of all Radon probability measures on Q.. We denote by Ep{A'} the expected 

value of random variable x : — ^ M with respect to p S ^{Q.). An information distance is 
a function I : ^ x ^ ^ 'RU {00} that is closed (lower semicontinuous) in each argument. 
An important example is the Kullback-Leibler divergence Ikl{p,'J) '■= lEp{ln(;7/^)} [23|. We 
remind that {x} is linear in p, and Ikl {p,l)i^ convex. The variational problem is formulated 
as follows: 

maximize (minimize) Ep{x} subject to E,p{ln{p/q)} < X (2) 

where optimization is over probability measures p £ This problem can be considered as 
linear programming with an infinite number of linear constraints, and it can be formulated as 
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the following convex programming problem: 

minimize Ep{ln{p/q)} subject to Kp{x}>v (Ep{x}<v^ (3) 

Figure [1] illustrates these variational problems on a 2-simplex of probability measures over a 
set of three elements with the uniform distribution q{o)) = 1/3 as the reference measure. 




Figure 1: 2-Simplex ^ of probability measures over set Q. = {coi , 0>2, COj} with level sets of 
expected utility Ep{jc} = v and the KuUback-Leibler divergence Kp{ln{p/q)} = X. Proba- 
bility measure is the solution to variational problems ^ and (|3l). The family {^p^^x of 
solutions, shown by dashed curve, belongs to the interior of 3^ . 

In optimization and information theories, IEp{;c} represents expected utility to be maxi- 
mized or expected cost to be minimized. In physics, it represents internal energy. Information 
distance Ikl{Pt(i) is also called relative entropy, and the inequality Ikl{p,q) < X represents an 
information constraint. Depending on the domain of definition of the probability measures, 
the information constraint may have different meanings, such as a lower bound on entropy (i.e. 
irreducible uncertainty), partial observability of a random variable, a constraint on the amount 
of statistical information (i.e. a number of independent tests, questions or bits of information), 
on communication capacity of a channel, on memory of a computational device and so on [35 ]. 
These variational problems can also be formulated in quantum physics, where x is an element 
of a non-commutative algebra of observables, and p, q are quantum probabilities (states). 

As is well-known, solutions to problems (O and ^ are elements of an exponential family 
of probability distributions. Before we define an appropriate generalization of these problems, 
we remind some axiomatic principles underpinning the choice of functionals. 

2.1 Axioms behind the choice of functionals 

The choice of linear objective functional Ep{x} has axiomatic foundation in game theory fTl\. 
where Q. is equipped with total pre-order <, called the preference relation, and function x : 
n M is its utility representation: COi < G)2 if and only if x{(Oi ) < x(g)2). Because the quotient 
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set n/ ~ of a pre-ordered set with a utility function is isomorpliic to a subset of the real line, 
it is separable and metrizable by p{[a], [b]) = \x{a) —x{b)\, and therefore every probability 
measure on the completion ofQ./ ~ is Radon (e.g. by Ulam's theorem for probability measures 
on Polish spaces). 

The set J3^{Q.) of all classical probability measures on H is a simplex with Dirac measures 
5(0 comprising the set ext of its extreme points (29). The question that has been discussed 
extensively is: How to extend pre-order <, which was defined on H = ext^, to the whole 
^? It was shown in |j2T| that linear (or affine) functional Ep{x} is the only functional that 
makes the extended pre-order compatible with the vector space structure of 7 D ^ 

and Archimedian. We remind that for the corresponding pre-order {¥,<) D {^,^) this is 
defined by the axioms: 

1. q ^ p implies q + r < p + r and aq <ap for all r G F and a > 0. 

2. nq<p for all « G N implies ^ < 0. 

In this paper we shall follow this formalism assuming that the objective functional is linear. 
We note that non-linearity may arise in certain dynamical systems, where x may change with 
time, but this will not be considered in this work, because our focus is on optimization prob- 
lems with respect to some fixed preference relation < or utility x onQ.. A non-commutative 
(quantum) analogue of a utility function was given in Q by a Hermitian operator xonn sepa- 
rable Hilbert space (an observable) with its real spectrum representing a total pre-order on its 
eigen states. The principal difference with the classical theory is the existence of incompatible 
(non-commutative) utility operators. 

As mentioned earlier, information constraints may be related to different phenomena (e.g. 
uncertainty, observability, statistical data, communication capacity, memory, etc). However, 
in information theory they often have been represented by functionals, such as relative entropy 
or Shannon information, which are defined using the KuUback-Leibler divergence Ikl- Its 
choice is also based on a number of axioms |[T4l[T9l[33l . such as additivity: lKL{p\P2,qiq2) = 
lKL{pi,qi) +lKL{P2,q2)- In fact, this axiom is precisely the reason why the logarithm function 
appears in its definition (i.e. as homomorphism between multiplicative and additive groups of 
M). There is, however, an abundance of other information distances and metrics, such as the 
Hellinger distance, total variation and the Fisher metrics. Although they often fail to have 
a proper statistical interpretation |[T2l . there has been a renewed interest in using different 
information distances and contrast functions in applications to compare distributions (e.g. see 

Eiiaiaa). 

For reasons outlined above, we shall generalize problems ^ and Q by considering an 
abstract information distance or resource, which will be used to define a subset of feasible 
solutions. In addition, we shall not restrict the problems to normalized measures, which makes 
the exposition a lot simpler. Normalization can be performed at a later stage. We now define 
an appropriate algebraic structure. 

2.2 Dual algebraic structures 

Let X and Y be complex linear spaces put in duality via bilinear form {■,■): X xY ^ C: 

{x,y) =0, VxGX ^y = 0, {x,y)=0, Vj G F ^x = 

We denote by X^ the algebraic dual of X, by X' the continuous dual of a locally convex space 
X and by X* the complete normed dual space of (X, || • ||). The same notation applies to 
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dual spaces of Y. The results will be derived using only the facts that X and Y are ordered 
linear spaces in duality. These spaces, however, can have richer algebraic structures, which we 
briefly outline here. 

Space X is closed under an associative, but generally non-commutative binary operation 
■ -.X xX ^X (e.g. pointwise multiplication or matrix multiplication) and involution as a self- 
inverse, antilinear map * : X ^ X reversing the multiplication order: {x*z)* = z*x. Thus, X 
is a *-algebra. The set of all Hermitian elements x = x* is a real subspace of X, and if every 
x*x has positive real spectrum, then X is called a total *-algebra, in which the spectrum of all 
Hermitian elements is real. In this case, Hermitian elements x*x form a pointed convex cone 
X^, generating X = X^ — X^. 

The dual space Y is closed under the transposed involution * :Y ^Y, defined by {x,y*) = 
{x*,y)*. It is ordered by a positive cone := {y : {x*x,y) > 0, Vx G X}, dual of X+, and 
it has order unit 3^0 £ (also called a reference measure), which is a strictly positive linear 
functional: {x*x,yo) > for all x / 0. If the pairing (•, •) has the property that for each z. € 
X there exists a transposed element z' € F such that {zx,y) = {x,z'y), then Y D X is a left 
(right) module over X with respect to the transposed left (right) action y z'y (y yz*'*) 
of X on F such that (xz)' = z'x' and {x,yz*'*) = {x*,z*'y*)* = {z*x\y*)* = {xz,y) (see S, 
Appendix). In many practical cases, the pairing (•, •) is central (or tracial), so that the left and 
right transpositions act identically on y^. z*'yo = yoz'* for all z S X. In this case, the element 
z*'yo = y^z'* € Y can be identified with a complex conjugation of z € X. 

Two primary examples of a total *-algebra X, which are important in this work, are the 
commutative algebra Q (H, C, || • ||oo) of continuous functions with compact support in a locally 
compact topological space Q. and the non-commutative algebra Cf(^,C, || • ||oo) of compact 
Hermitian operators on a separable Hilbert space J^. The corresponding examples of dual 
space F = X* are the Banach space ^{Q., C, || • || 1 ) of complex signed Radon measures on Q. 
and its non-commutative generalization ^(^, C, || • || 1 ). Note that these examples of algebra 
X are generally incomplete and contain only an approximate identity. However, by X we shall 
understand here an extended algebra that contains additional elements. In particular, X will 
contain the unit element 1 G X such that {\,y) = \\y\\\ if y > (i.e. 1 € X coincides on F+ 
with the norm || • ||i, which is additive on F+). Furthermore, because constraints in variational 
problems Q or Q, or their generalizations, define a proper subset of space F, we can consider 
random variables represented by elements x G F** that are outside of the Banach space F* (e.g. 
unbounded functions or operators). 

Below are three main examples of pairing X and F by a sum, an integral or trace: 



Although the linear functionals x{y) = {x,y) are generally complex- valued, we shall assume, 
without further mentioning, that (•, •) is evaluated on Hermitian elements x = x* and j = so 
that (x,y) e M. In particular, the expected value Ep{x} = {x,p) € M, where x is Hermitian and 
p is positive. Thus, the expressions 'maximize (minimize) x{y) = {x,y)' should be understood 
accordingly as maximization or minimization of a real functional. 

2.3 Generalized variational problems for measures 

Normalized non-negative measures (i.e. probability measures) are elements of the set: 






(4) 



J^:={yeY:y>0, {l,y) = l} 
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This is a weakly compact convex set, and therefore = clcoext ^ by the Krein-Milman the- 
orem. In the commutative case, ^ is a simplex, because each € is uniquely represented 
by extreme points 8 € ext,f^ [29 1. In information geometry ^ is referred to as statistical 
manifold, and its topological properties have been studied by defining different information 
distances / : ^ x ^ ^ U {o°} |l3l[T2j[30l. We can generalize this by considering informa- 
tion resource as a functional, defined for all positive or Hermitian elements. 



u - 




Constraint values, A >F{y) 



Figure 2: Optimal value functions v = x{X) and v = x{X). The value Ao = infF corresponds 
to u € [u.oi^o]- Special values A, A of the constraint A > F{y) correspond respectively to 
optimal values v and u . 

Let 7^ : F — > M U {°°} be a closed functional, so that F is finite at some y , and sublevel 
sets {y : F{y) < A} are closed in the weak topology a{Y,X) for each A. Because — oo is not 
included in the definition of closed F, it is also lower-semicontinuous [32]. We shall assume 
without further mentioning that the effective domain domF '■= {y '■ F {y) < oo} has non-empty 
algebraic interior. In addition, if Y is defined over the field of complex numbers, we shall also 
assume that domf contains only Hermitian elements y = y* (e.g. domF C Y^). 

Variational problems Q and (O are generalized by considering all, not necessarily posi- 
tive or normalized measures, and by using any closed functional F to define an information 
resource. The optimal values achieved by solutions to these problems are defined by the fol- 
lowing optimal value functions: 



x(A) := sup{(^,3;):F(3;)<A} (5) 

x(A) := inf{(^,3;) :F(3;)<A} (6) 

x-\v) := M{F{y):{x,y)>v} (7) 

x-\v) := mf{F{y):{x,y)<v} (8) 



We define x{X) := — oo, if A < infF, and x(oo) := limx(A) as A — oo. Observe that x(A) = 

— (— x)(A) and ^^{v) = (— x) (— u). Thus, it is sufficient to study only the properties of 
x(A). Figure |2] depicts schematically the optimal value functions x{X) and x{X). It is clear 
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from the definition that x{X) is a non-decreasing extended real function, and x{X) is non- 
increasing. It will be shown also in the next section that x{X) is concave, and x{X) is convex 
(Proposition [3l). Because sets {y : F{y) < A} may be unbalanced and unbounded, the functions 
may not be reflections of each other in the sense that x(A) — Do 7^ — for all Vq, and one 
or both functions can be empty. The definition of the optimal value functions ©-(H]) in terms 
of functional F{y) of one variable, unlike information distance I{y,yo), allows for considering 
the case when infF is not achieved at any yo € Y. 

In addition to Xq := infF, we define two special values A and A of functional F as follows: 

x{X) :=sup{{x,y) -.y GdomF}, x{V) :=inf{{x,y) :y ^domF} (9) 

Thus, problems of maximization or minimization of x{y) = {x,y) subject to constraints F{y) < 
X or F{y) < X_ respectively are equivalent to unconstrained problems on domF. The corre- 
sponding optimal values are denoted V = x{X) and u. = i(A), as shown on Figure |2l The 
reason for defining these values is that generally A < oo, A < oo and A 7^ A (see Figure 
Solutions to unconstrained problems may correspond to large, possibly infinite values A or A, 
and therefore they can be considered unfeasible. Subsets of feasible solutions will be defined 
by constraints F(j) < A < A or F(j) < A < A. 

In addition, we define the following special values: 

tJo:= lim sup{(x,3;) :F(3;) <A}, Vj^ := lim M{{x,y) : F{y) < X} (10) 

A^infF A^infF 

If there exists a set dF*{0) C domF such that infF = F(jo) for all 3^0 € dF*{0), then Vq = 
sup{{x,yo) : yo G dF*{0)} and Uq = inf{(-x,3'o) : Jo G dF*{0)}. If jo is unique, then Uq = Vq', 
otherwise Uq > Vq (see Figure |2ll. Elements yo € dF*(0) represent trivial solutions, because 
they correspond to constraint Aq := infF in functions 1(A) and ^(A). Constraints {x,y) >v> 
Vo and {x,y) < u < tJq in the inverse functions x^' (u) and x^^ (u) ensure that F(y) > Ao, and 
the solutions are non-trivial. 

2.4 Some facts about subdifferentials of dual convex functions 

In the next section, we show that solutions to the generalized variational problems with optimal 
values (l5]l-([8]l, if exist, are elements of a subdifferential of functional F*, dual of F. We remind 
that F* : X ^ M U {00} is the Legendre-Fenchel transform of F: 

F*{x):=sup{{x,y)-F{y)} 

and it is aways closed and convex (e.g. see Il32l[38l ). Condition F** = F implies F is closed 
and convex. Otherwise, the epigraph of F** is a convex closure of the epigraph of F in F x M. 
Closed and convex functionals are continuous on the (algebraic) interior of their effective 
domains (e.g. see [|25J or 1,32,1 . Theorem 8), and they have the property 

x£dF{y) ^ dF*{x)3y (11) 

where set dF{y) := {x : (x,z — 3^) < F(z) — F{y) , Vz € F} is subdifferential of F at y, and 
its elements are called subgradients . In particular, € dF{yo) implies F{yo) < F{y) for all 
y (i.e. infF = F(yo))- We point out that the notions of subgradient and subdifferential make 
sense even if F is not convex or finite at y, but non-empty dF{y) implies F{y) < 00 and F{y) = 
F**{y), dF(y) = dF**{y) (f32l. Theorem 12)0 Functional F* is sti-ictly convex if and only if 
dF*{x) 3y is injective, so that the inverse mapping dF{y) = {x} is single- valued. 

'it is possible, iiowever, tiiat F{y) < 00, but dF(y) = (e.g. see 1381 , Chapter 1, Section 2.4, Example 6d). 
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Recall also that subdifferential dF* : X — 2^ of a convex function is an example of mono- 
tone operator flSl : 

(xi -X2,yi - J2) > 0, \fyi G dF*{xi) (12) 

The inequality is strict for all x\ / X2 if and only if dF*{x) 9 y is injective (i.e. dF* is strictly 
monotone). 

We remind also that // : F — > M U {— °°} is concave if F{y) = —H(y) is convex. The dual 
of H in concave sense is H*{x) := mf{{x,y) —H{y)}. By analogy, one defines supgradient 
and supdijferential of a concave function Il32l . 

3 General properties of optimal solutions and the optimal value 
functions 

In this section, we apply the standard method of Lagrange multipliers to derive solutions y^ 
achieving the optimal value x(A) = {x,y^). Then we shall study existence of solutions and 
monotonic properties of the optimal value functions ©-([8]). 

3.1 Optimality conditions 

Proposition 1 (Necessary and sufficient optimality conditions). Element yp maximizes 
linear functional x{y) = {x^y) on sublevel set {y : F{y) <X} of a closed functional F :Y ^ 
M U {oo} if and only if the following conditions hold 

ypedF*{Px), F{yp) = X 

where parameter j8^^ >0 is related to A via j8^' G dx{X). 

Proof. If yp maximizes {x,y) on sublevel set C(A) := {y : F{y) < A}, then it belongs to the 
boundary of C(A) (because (x, •) is linear and C(A) is closed). Moreover, yp belongs also to the 
boundary of a convex closure of C(A), because it is the intersection of all closed half-spaces 
{y : {x,y) < {x,yp)} containing C(A). Observe also that 

dco{y:F{y) <X} = {y:F**{y) <X} 

and therefore solutions satisfy condition F{yp) = F**{yp) and dF{yp) = dF**{yp) (e.g. see 
Il32l . Theorem 12). Thus, the Lagrange function for the conditional extremum in ([5]l can be 
written in terms of F** as follows 

K{y,^-') = {x,y)+^-\X-F**{y)], 

where jS^Ms the Lagrange multiplier for the constraint A > F**{y). This Lagrange function is 
concave for j8^' > 0, and therefore condition dK{yp,^^^) 9 is both necessary and sufficient 
for yp and j8^^ to define its least upper bound, which gives 

dyK{yp,^-')=x-^-'dF**{yp)30, ypedF*{^x) 
dp-^K{yp,p-') = X-F**{yp) =0, ^ F**{yp) = X 

Note that if F / F**, then generally F**{y) < F{y), and condition F**{yp) = A must be 
replaced by a stronger condition F{yp) = A. 

Noting thatx(A) = {x,yp) + [X—F{yp)], the Lagrange multiplier is defined by dx{X) 3 
j8^^ Note that dx{X) > 0, because x(A) is non-decreasing, and j8^' =0 if and only if 
F{y)>I. □ 
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Remark 1. The inverse optimal value x ' (u), defined by equation (|7]l, is achieved by solutions 
given by similar conditions. Indeed, the corresponding Lagrange function is 

K{y,^)=F»{y)+^[v-{x,y)] 

and the necessary and sufficient conditions are 

yp€dF*{l5x), {x,yp) = v 

where j3 > is related to v via j3 G dx^\v). We note also that conditions for optimal values 

x(A) = —{—x){X) and x^^{v) = (— x) (— u), defined by equations ^ and are identical 
to those in Proposition [T] and above with the exceptions that j3^^ < and p <0. 

3.2 Existence of solutions 

The existence of optimal solutions in Proposition [T] is equivalent to finiteness of x(A), which 
depends on the properties of sublevel set C(A) := {3^ : ^(3^) < A} and linear functional x{y) = 
{x,y). Clearly, the existence of solutions is guaranteed if C(A) is bounded in (7, || • ||) and 
X G Y*. This setting, however, appears to be too restrictive. First, the restriction of x to Banach 
space F* is not desirable in many applications. Indeed, measures are often considered as ele- 
ments of a Banach space with norm || • || 1 of absolute convergence, and therefore Y* is complete 
with respect to the Chebyshev (supremum) norm || • ||oo. Many objective functions, however, 
such as utility or cost functions, are expressed using unbounded forms, such as polynomials, 
logarithms and exponentials. Second, the sublevel sets C(A) are generally unbalanced (i.e. if 
^(3',3'o) / /(3'o,3') or ^(3^0 + \y-yo]) /^(yo - [y-yo])), which means that x(A) / 
and therefore x{X) G M does not imply (— x)(A) G M. In addition, sets C(A) can be unbounded 
in (F, II • II) if we allow for measures that are not necessarily normalized. In this case, finiteness 
of x(A) is no longer guaranteed, even if x G F*. These considerations motivate us to define the 
most general class of linear functionals x G F** (elements of algebraic dual) that admit optimal 
solutions to the generalized variational problems for measures and achieving finite optimal 
values for all constraints. 

Definition 1 (7^ -bounded linear functional). An element x G F* is bounded above (below) 
relative to a closed functional F : F ^ M U {00} or F -bounded above {below) if it is bounded 
above (below) on sets {y : F{y) < X} for each A G (Ao,A) (A G (Ao,A)). We call x G F^* 
F -bounded if it is F-bounded above and below. 

Thus, bounded linear functionals x G F* are || • |[-bounded. If F{y) = I{y,yQ) is under- 
stood as information, then we speak of information-bounded functionals. Although we do 
not address topological questions in this paper, we point out that the values x(A) coincide 
with the values of support function 5c(A)(-^) •= sup{(x,j) : y G C(A)} of set C(A), and it gen- 
eralizes a seminorm on Y' . In fact, a seminorm can be defined for F-bounded elements as 
sup{— x(A),x(A)} = ?,up{sc(X){—x) ,sc(i){x)} , which means they form a topological vector 
space. There are, however, elements x G F^* that are only F-bounded above or below, as will 
be illustrated in the next example. 

Example 1. Let = N and let X, Y be the spaces of real sequences {x{n)} and {y{n)} with 
pairing (•, •) defined by the sum (01). Let F(3') = (Inj — \,y) for 3^ > 0, so that the gradient 
VF(3') = ln3', and F is minimized at the counting measure yQ{n) = 1. The optimal solutions 
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have the form yp = e^^ , and the values of functions x{X) and x{X) = —{—x){X) are respec- 
tively 

oo oo 

{x,yp) = l^x{n)e^'^'^'^^ and {x,yp) = l^x{n)e-^^'^'\ > 

n=l n=\ 

In particular, for x{n) = —n, the first series converges to —e^{e^ — 1)^^, but the second di- 
verges for any j3^' > 0. Thus, x{n) = —n is F-bounded above, but not below. Observe also 
that;t:(?i) = —n is unbounded, because ||;<;||oo := sup{|(A;,3;)| : \\y\\\ < 1} is infinite. On the other 
hand, any constant sequence x{n) = a G M is bounded (||x||co = |o;|), but it is not F-bounded 
above or below. 

The criterion for element ;t: G to be F-bounded above follows from the optimality con- 
ditions, obtained in Proposition [T] 

Proposition 2 (Existence of solutions). Solutions y^ maximizing x{y) = {x,y) on sets 
{y : F(y) <X} exist for all values X € (Ao, A) of a closed functional F : y ^ MU {oo}, if there 
exists at least one number j8^' > such that subdifferential dF*{[5x) is non-empty. 

Proof. The element yp € dF* (j8x) maximizes x{y) = {x,y) on {y : F (y) < X} by Proposition [T] 
and if j8^^ > and xj^O, then F{yp) = X e {Xo,X). The optimal value x{X) G M is equal to 

{x,yp)=p-' [F*{Px)+F{yp)] 

Note also that F*{^x) G (infF*,supF*). Because sets {y : F{y) < X} are closed for all X (F 
is closed), the existence of a solution for one X implies the existence of solutions for all X , and 
they are y^ G dF*{^x) enumerated by different values j8^^ > 0. □ 

Thus, element ;c G 7* is F-bounded above if dF*{px) is non-empty at least for one jS^' > 
0. Geometrically, this means that can be absorbed into the convex set C*{X*) := {w : F*(w) < 
A*}forsomeA* G (infF*,supF*). If x G F** is also F-bounded below, then —x can be absorbed 
into C*(A*). Therefore, if jc G F** is F-bounded only above or below, then the origin of a 
one-dimensional subspace Mjc := {fix : j8 G M} is not on the interior of domF*. In fact, it 
is well-known that if sets C{X) := {y : F{y) < X} are bounded, then G lnt(domF*) (see 

nana). 

3.3 Monotonic properties 

Proposition 3 (Monotonicity). Optimal value functions x{X), x{X), x^^{v) and x^^ {v), de- 
fined by equations (O, (0 and dS]) for a closed F : F — > M U {oo} and x ^ 0, have the 
following properties: 

1. The mapping X i— > j8^' G dx{X) is non-increasing, and U i— t- j3 G dx^^{v) is non- 
decreasing. 

2. If in addition F* is strictly convex, then these mappings are dijferentiable so that j8^^ = 
dx{X)/dX and j8 = dx^^{v) /dv. 

3. x{X) is concave and strictly increasing for X G [Ao, A]. 

4. x{X) is convex and strictly decreasing for X G [Ao, A]. 

5. x^^{v) is convex and strictly increasing for V G [tJo,U]. 
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6. ^^^(u) is convex and strictly decreasing for v G [i^,l>o]- 

where X, X are defined by equations ([9]), and Vq, Vq by equations (ilOi . 

Proof. 1. Let yp^, y^ be maximizers of linear functional x{y) = {x,y) on sublevel sets 
{y : F{y) < A} with constraints X\, X2 respectively, and let Vy = {x,yp^ ) and V2 = {x,yp2) 
denote the corresponding optimal values. Clearly, Ai < A2 implies Vi < V2 by the inclu- 
sion {y :F{y) < Ai } C {y :F{y) < A2 } , so that the optimal value function x{X) = {x,yp) 
is non-decreasing. Using condition yp G dF*{px) of Proposition [T] and monotonicity 
condition (1121) for convex F*, we have 

{p2x-pix,yii2 -yp^) = {^2-^\){x,yp2 -y/Ji) >0 

Therefore, Vi < V2 implies j3i < ^2- This proves that A i-> jS^Ms non-increasing, and 
u i-> j8 is non-decreasing. 

2. Optimality condition yp G dF*{^x) is equivalent to ^x G dF{yp) by property (fTTl) . 
and together with condition F{yp) = X or {x,yp) = u it implies that different j3i < p2 
can correspond to the same A or u if and only if dF{yp) includes both j3ix and j32X. 
This implies that F* is not strictly convex on [j8ix,j82x] C dF{yp). Dually, if F* is 
strictly convex, then jSi / ^2 implies Ai / A2 and Ui 7^ V2, so that {j8^'} = 5]c(A) and 
1/3} = In this case, monotone functions x{X) and x^^(u) are differentiable. 

3. Function 1(A) is strictly increasing on A G [Ao, A], because dx{X) B j8^' > and j8^' = 
if and only if A > A (Proposition[Tl). The mapping A 1-^ j3^' G dx{X) is non-increasing, 
and therefore x(A ) is concave. 

4. By the same reasoning as above, function (— x) (A) is concave and strictly increasing for 
A G [A(), A]. Thus, x(A) = — (— x)(A) is convex and strictly decreasing. 

5. Function x^'(u) is strictly increasing for all v G [uoil^], because dx^^{v) 3 p >0, 
and j3 = if and only if u = (x,jo) < 1^0 for any yo G dF*{0) (Ao := infF = F{yo)). 
Moreover, the mapping u 1-^ jS G dx^^{v) is non-decreasing, and therefore x^'(u) is 
convex. 

6. Function x^'(u) is the inverse of convex and strictly decreasing function x(A). Thus, 
x^^(u) is also convex and strictly decreasing for v G [u.)I^o]- 

□ 

We now use the facts that X is ordered by a pointed convex cone X+, generating X = 
X+ —X^, and that Y is ordered by the dual cone: := {y : {x,y) > 0, Vx > 0}. For 
example, this is the case when X is a function space with the pointwise order, or if X is the 
space of operators on a Hilbert space with x*x G X+. 

Proposition 4 (Zero solution). Let X be ordered by a generating pointed cone X+, and let 
{yp}x be the family of all elements maximizing linear functional x{y) = {x,y) on sets {y : 
F{y) < X} for all values X of a closed functional F : F — ?> MU {°°}. If all yp G {3'j3}x cif^ 
non-negative andyp = Ofor some X, then 

x = or F{0) = Xo or F(0)=A 
where Ao := infF, and X is such thatx{X) = sup{(x,3') : y G domF} 
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Proof. Assume the opposite: x^O and Ao < F(0) < A. Then function x{X) = {x,yp) is strictly 
increasing (Proposition [3]), and sets {y : F{y) < F{0)} and {y : F{0) < F{y)} are non-empty 
(F is closed). Thus, there exist solutions y\ and j2 such that 

F{yi)<F{0) <F{y2) and (x,3;i) < < (^,^2) 

Using decomposition x = x+ — x_, x+, x_ € X+ and 3^1, 3^2 G i+> we conclude that 

(x_|_ — x_,yi) < < (x+ — x_,j2) ^ x+>x_ and x_|_<x_ 

This implies x = 0, which is a contradiction. □ 

4 Optimal measures 

Our interest is in the support set of optimal positive measures maximizing linear functional 
x{y) = {x,y) on closed sets {y : F{y) < A}. First, we shall prove the main theorem about 
mutual absolute continuity within families of optimal measures. Then we shall discuss the 
underlying property of an information functional. In the end of this section, we formulate a 
corollary stating that support of a utility function or operator is contained in the support of 
optimal measures. 

4.1 Mutual absolute continuity of optimal measures 

Let X be a *-algebra with a unit element 1 € X . Recall that X can be associated with the algebra 
M{Q.) of subsets of Q. in the classical (commutative) setting, or with the algebra ^(^) of 
operators on a Hilbert space in the non-classical (non-commutative) setting. A subalgebra 
^{E) of subset £■ C n or subspace E C M' corresponds in each case to a subalgebra M CX, 
and we shall use notation y{M) = to denote measures that are zero on subset or subspace 
E. The dual of subalgebra M C X is the factor space Y /M^ of equivalence classes [y\ := {z € 
F : y - z G M^} generated by the annihilator ■= {y : (x,y) = 0, Vx € M}. Thus, the 
elements of Y /M^ correspond to measures that are equivalent on M, and = [0] G Y /M^ 
is the subspace of measures y{M) = 0. 

We shall define the restriction of functions or operators x to subset or subspace E as their 
localization YImx, where : X — )• M is a positive 'super' operator (i.e. a linear operator 
acting on the algebra of functions or operators) such that YIm{X) = M and YIm{x*x) > 0. Note 
that when X is a commutative algebra, one can always define with the projection property 
Yl]^ = YIm, leaving M invariant. In the non-commutative case, a projection of X onto M exists 
if and only if M is invariant under the action of a modular automorphism group (see [37 1 for 
details). More specifically, the positive operator 11^ satisfies in this case condition Hm{wx) = 
wIIm{x) for all w G M and all x G X . If in addition ITm ( 1 ) = 1 , then IIm is the non-commutative 
generalization of conditional expectation (e.g. see [28]). Clearly, only subalgebras M C X 
with projections have statistical or physical meaning. Note that one can always construct a 
completely positive linear operator IT^, which becomes a projection onto M, if M has the 
above mentioned property of modular automorphism invariance |[l]. We shall refer to such 
Hm as localization onto subalgebra M. The restriction of F* : X — )■ M U {00} to M is given by 
F*{nMx), and the dual of F*(nMx) is defined on Y/M^ as F**{[y]) := inf{F**(3;) : y G [y]}. 

Theorem 1 (Mutual absolute continuity). Let X be ordered by a generating pointed cone X+, 
and let {yp}x be the family of all elements maximizing linear functional x{y) = {x,y) on sets 
{y '■ F{y) < A } for all values X of a closed functional F : 7 ^ M U {00}. If ally ^ G {yp }.v are 
non-negative and F*{x) := sup{(x,j) —F(y)} is strictly convex, then: 
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1. There is a subfamily {y°p}x ^ {y^}x containing y°p for each X G {Xq,X), and y°p corre- 
spond to mutually absolutely continuous positive measures. 

2. If there exists element yo (resp. 5x) in {yp}x such that infF = F{yQ) (resp. sup{(:t, : 
y G domF} = {x,5x)), then yo (resp. 5x) is absolutely continuous w.r.t. ally°^. 

3. If in addition F** is strictly convex, then {y^ }x = {yp }x \ {yo, 8x}. 

Proof. I^etyp be a solution for some X G (Ao,A). Thenyp G dF*{j5x), < jS^^ < oo (Propo- 
sition [T]). Let Hm : X — M be a localization operator onto subalgebra M C X (i.e. a com- 
pletely positive linear operator that acts as a projection onto some subalgebras [1]). Then 
[yp] £ dF*{pilMx) C Y/M-^. Assume that the corresponding measure yp{M) = 0. Then 
yp G [0] G F /M^, where [0] = M-^, and because \yp]>0(yp>0 and Um is positive), \yp] = [0] 
implies by Proposition |4] 

nMX = or F**([0]) = Ao or F**{[0]) = Xm 

where Xq := infF, and Xm < A is such that Hmx{Xm) = supKn^Jc, [y]) '■ \y\ G domF**}. 
Observe that non-empty dF**{[Q\) is a singleton set, because F* (and hence F*(JImx)) is 
strictly convex. Therefore, the last two cases above are false, because otherwise dF**{[0\) 
would contain the intervals ^,^YImx] or [^YlMX,oa), < p < oo. Thus, Hmx = is the only 
true case. But then ^TImx = for all j8, and therefore 

[0]edF*{pnMx), VjSGM 

In other words, for each X G (Ao, A), there is a solution yp G [0], such that the corresponding 
measure yp{M) = 0. 

These measures are not mutually absolutely continuous only if there exists solution y°p for 
some X G (Ao,A) such that the corresponding measure y°p{N) = on some larger subalge- 
bra N D M. The subfamily {y°p }x C {yp }^ corresponding to mutually absolutely continuous 
measures for all A G (Aq, A) is constructed by taking 

M = sup{N CX:3y°p€ {yp },, fp (N) = 0} 

where supremum is with respect to ordering by inclusion. 

If Xq := inf F (resp. v := sup{{x,y) : y G domF}) is attained at some 3^0 (resp. 5x), then they 
correspond to elements of {yp}x with j8 = (resp. =0). The corresponding measures 3^0 
(resp. dx) are absolutely continuous with respect to all y°p, because Hmx = imphes ^Hmx = 
for all i3. 

If F** is strictly convex, then dF*{j3x) contains a unique element y°p for each jS^^ > 0, 
and {3;° = {yp }x \ {yo, 5x}. □ 

Remark 2. The key condition in the proof of Theorem [T] is that the non-empty subdifferentials 
dF{yp) are singleton sets, which follows immediately from injectivity of dF* or strict convex- 
ity of F*. \fyp G Int(domF**), thenF** is continuous atyp (e.g. see [25J or [32,1 . Theorem 8), 
and dF**{yp) is a singleton if and only if F** is Gateaux differentiable at yp (e.g. see p8]. 
Chapter 2, Section 4.1). Injectivity of dF* can also be based on its algebraic properties. In 
particular, if dF* is a group homomorphism, then it is injective if and only if its kernel is a 
singleton set. This will be discussed in the end of Example [2] (see also [SiD. 
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Optimal probability measures are obtained by normalization pp '■ = yp/\\yj}\\i of optimal 
positive measures yp. This corresponds to additional equality ||y||i = {l,y) = 1 and inequality 
y >0 constraints in the optimal value functions ©-([Hi or simply to a restriction of functional 
F to the statistical manifold ^ := {y : y > 0, {l,y) = I}, which is the base of positive cone 
F+. Optimal probability measures are solutions to generalized variational problems Q or (l3]l 
with constraints on information distance I{p,q) or resource F{p). All mutually absolutely 
continuous measures y^ € {yp }x belong to the same subspace C Y, and the corresponding 

probability measures /j^ belong to the interior of the base ^HM-^ of subcone C F+. In 

the classical (commutative) case, ^ is a simplex, and ^ fl M-^ is its facet, which is itself a 
simplex. 

Remark 3. If the effective domain domF C F of functional F : 7 ^ MU {oo} is the positive 
cone F+, then property yp{M) = on subalgebra M C X implies yp is on the boundary of 
F+ = domF. In this case, mutual absolute continuity of measures yp € dF*{px) can be proved 
using the fact that the image of injective subdifferential mapping dF* : X is interior of 
domF (e.g. see m. Lemma 4). Therefore, such subgradients yp € dF*{^x) cannot be on the 
boundary of = domF. 

The existence of optimal and mutually absolutely continuous probability measures for all 
constraints F{y) < A on an information resource is used in the next section to study optimality 
of deterministic and non-deterministic Markov transition kernels. Theorem [T] shows that this 
is related to strict convexity of F* (or injectivity of dF*), and therefore we now discuss this 
property with some examples. 



4.2 Information and separation of variational problems for measures 

If F* is not strictly convex (or dF* is not injective), then dF{yp) may contain different ele- 
ments x,w ^ YK Recall that linear functional x G F** are understood in classical optimization 
theory as objective (e.g. utiUty) functions x : 11 — R representing a preference relation < on 
Q. = ext^. Thus, yp may maximize both x{y) = {x,y) and w{y) = {w,y) on {y : F{y) < A}, 
which means that yp solves different optimization problems. Indeed, value A = F{yp) cor- 
responds to equal optimal values x^\v) = vv^'(u), and value v = {x,yp) = {w,yp) to equal 
optimal values x(A) = vv(A). Therefore, if F* is not strictly convex, then elements yp ^Y may 
not separate some optimization problems. Let us consider two examples. 

Example 2 (Relative information). Let us define Ikl : F x F — M U {oo} as follows 

iKL{y,yo) ■= 



ln-^,3;^ - (l,3;-yo) if y > and yo > 
"l,yo) ify = Oandyo>0 (13) 

otherwise 



This functional is an extension of the KuUback-Leibler divergence ¥,p{ln{p/q)} to the whole 
space F, because (l,y — yo) = for positive measures y, yo with equal norms || • ||i. The term 
(l,y — yo) makes /A:L(y,yo) > for all elements y and yo not necessarily with equal norms. 
If X is a commutative algebra, and the pairing (•, •) is defined by the sum or the integral (IDl, 
then ([T3] ) reduces to the classical KL-divergence. In the non-commutative case, such as X 
being an algebra of compact Hermitian operators and the trace pairing dH), functional (fT3] ) is 
a generalization of some types of quantum information ||9l, which depend on the way yyg ' is 
defined, such as exp(lny — Inyo) or yQ JJq 
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The functional FKtiy) '■= lKL{y,yo) is closed, strictly convex and Gateaux differentiable on 
Int(domF^i), and its gradient has the following convenient form: 

VFKdy) = In ^ ^ y'J'e^y'J' = y^d^) 
yo 

One can define the dual functional F^^ : X ^ M U {00} as follows 

F^,{x):={iy,/'e^/J') 

Clearly, F^^ is also closed, strictly convex and Gateaux differentiable for all x € X, where it is 
finite. Optimal measures maximizing x{y) = {x,y) on sets {y : Fxiiy) < A} belong to a one- 
parameter exponential family yp := y^^'^e^^^y^^^, which are mutually absolutely continuous. 
Such maximizing measures exist for all values A G (Ao, A), if x € F*^ is F^fi -bounded above, 
and by Proposition |2] it is sufficient to show that dF^^{px) ^ for some j3^' > 0. We point 
out that this property depends on the choice of element yo = Vf^^(O), minimizing Fkl- 

Recall also that Y can be considered as a module over algebra X CY (Section l2!2l ). The 
exponential mapping exp : X ^ X C Y is the unique (up to the base constant) homomor- 
phism between the additive and multiplicative groups of algebra X, and it is injective, be- 
cause it has a singleton kernel {x : exp(x) = yy^^ = 1} = {0}. The property VFKtiy) = 
ln{yyQ^) = (exp)^'(yyQ ensures that information distance lKL{y,yo) = Ffciiy) is additive: 
lKL{p\P2,q\q2) =lKL{p\,qi) +lKL{P2,q2) for all P1P2, q\q2 G ^■ 




Figure 3: 2-Simplex ^ of probability measures over set 11 = {ft)i,ft)2,ft)3} with level sets of 
expected utilities Ep{x} = Ep{w} = u and the total variation metric — g|| 1 = A. Probability 
measure maximizes both Ep{x} and Ep{w} subject to constraint H/? — ^|| 1 < A. The family 
{pp}x of solutions, shown by dashed line, contains elements on the boundary of 

Example 3 (Total variation). An example of information distance that does not have a strictly 
convex dual is the total variation metric: 

iv{y,yo) ■= \\y-yoh 
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Functional Fyiy) := Iviy^yo) is not Gateaux differentiable at y = yo, as well as y such that 
J — Jo € [0] G Y /M-^, if subalgebra M CX bounds X+ (e.g. if M contains an extreme ray of 
X^). Optimal solutions yp maximizing x{y) = {x,y) on sets C(A) := {y : H)' — 3'o||i < ^} are 
extreme points of C(A), and they maximize different, not necessarily proportional linear func- 
tional. Figure |3] illustrates the variational problems on a 2-simplex of probability measures 
over a set of three elements with the uniform distribution q{Q)) = 1/3 as the reference measure 
(compare with Figure[T|l. Distribution pp maximizes both E^jjc} = {x,p) and Epfw} = {w,p) 
onC(A) := {p : ||/? — < A}. 

The dual of Fy is functional Fy{x) = Xc°{?.){^) ~ i^^yo), where Xc°{^){^) is the indicator 
function of set C^{X) = {jix : \\l5x\\^ < 1}, the polar of set Co(A) = C(i) -yo. Clearly, F^{x) 
is not strictly convex. Therefore, dFyiyp) may include multiple elements, and the family 
{yii}x may contain measures that are not mutually absolutely continuous. Figure [3] shows that 
the family {pp}x of optimal solutions contains elements on the boundary of 2-simplex 

In the commutative case, elements of dFyiyp) C X are understood as utility functions, 
representing preference relations < on Q. = ext^. If dFv{yp) includes functions x and w, 
then they attain their suprema supx(G)) = x(T) = ||x||c» and supw{co) = w(T) = |lw||oo on the 
set of the same elements T S il. However, the utility functions x{(o) and w{(o) may represent 
different preference relations < on Q.. Note also that the suprema x(T) or w(T) of utilities may 
never be achieved or observed in problems with constraints on information, even if x or w are 
bounded functions. The values of utilities on elements CO 7^ T are important for maximization 
of the expected utility. 

As was discussed in Section I2.1[ information is often required to satisfy the additivity 
axiom, which is why information-theoretic definitions of entropy and mutual information are 
based on the KL-divergence lKL{y,yo)^ and it has a strictly convex dual. Strict convexity 
of the dual functional is a weaker condition than the additivity axiom, but it ensures that 
each probability measure p € ^ is an optimal solution to a unique variational problem with 
an abstract information resource F, generalizing problems ^ or Note also that strict 
convexity of F* ensures that information resource F has directional derivative at each y € 
Int(domF) (e.g. p € Int(^)), which facilitates convergence of measures in problems with 
dynamic information. Thus, strict convexity of the dual functional appears to be a natural 
requirement on the functional representing information. 

4.3 Support of utility functions and operators 

We now conclude this section by the following corollary about the support of utility functions 
or operators. We remind that the support of function x : H ^ M is the set supp(x) := {(O : 
x{(o) 7^ 0}. The support of an operator x on a Hilbert space is defined as a projection onto 
the orthogonal complement of its kernel (e.g. |[T5l . Appendix III). When x is considered as 
an element of algebra X, its restriction to a subset E C D. (subspace E C Jif) is given by 
localization YImx of x onto subalgebra M cX corresponding to E. Thus, the support of x can 
be identified with the complement of the largest subalgebra M cX such that Hmx = 0. 

Corollary 1 (Support). Under the assumptions ofTheorem\J\ the support of element x £X is 
a subset of the support of optimal measures ypfor all X € (Ao, A). 

Proof. During the proof of Theorem [T] we established under its assumptions, that if solution 
yp (M) = for some A G (Ao, A) and M cX, then the locahzation YLmx = 0. Dually, if Hmx / 
for some M cX, then yp (M) ^ for all such yp . □ 
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Because random variables or observables are considered with respect to normalized pos- 
itive measures (i.e. probability measures), they can be treated not as elements of algebra X, 
dual of Y, but as elements of the factor space X/Ml, generated by subspace Ml := {jSl : 
j8 G M, 1 € X} of scalar vectors. Indeed, statistical manifold ^ is a subset of the affine set 
{y '■ (Ijj) = 1} = {1}-L + <?> where {1}_l is the annihilator of element 1 € X, and 3^. 
Thus, every probability measure p G ^ is equivalently represented by elements y G {1}_l 
as p = y^q. The dual of subspace {1}_l is the factor space X/Ml, and random variables 
are affine sets [x] = Ml +x corresponding to equivalence classes [x] = {w : x — w G Ml} and 
{x — w,p — q) =0 for any p, q ^ 3^ ■ Observe now that Ml is the zero element in X/Ml, and 
therefore the fact that localization n^/x ^ Ml implies pp{M) > for all optimal probability 
measures (Corollary [T]). Dually, pp{M) = implies that Hmx G Ml. In the language of clas- 
sical probability this can be stated as follows: if x(ft)i) ^ x{o}2) for some (Oi, 0>2 £ E C 
then pp{E) > for all probability measures maximizing Ep{x} on sets {p : F{p) < A} for all 
A G (Ao, A). Dually, pp{E) = implies that x(g)) = const for all CO £ E. 

5 Optimal Markov transition kernels 

In this section, we consider a composite system, such as a direct product Q. = AxB of two sets, 
and the problem of optimization of transitions between the elements of A and B. Such prob- 
lems appear in theories of decisions, control, communication and computation, where compo- 
nents of a system (represented by sets A, B, etc) may have different meanings, but the main 
objective is to find transitions between the elements of A and B that are optimal with respect 
to a utility function x : A x B — M. In some cases, optimal transitions are deterministic corre- 
sponding to some functions a = f{b) or b £ f^^ (a). More generally, non-deterministic transi- 
tions are represented by conditional probabilities or Markov transition kernels. For simplicity, 
our exposition will be in the classical setting of commutative algebra X := Cc(n,M, || • ||co) of 
functions on = A x B. This is because joint and conditional probabilities are well-defined 
and understood in this setting. In the non-classical case, the analogue of a conditional proba- 
bility operator can also be defined (e.g. |[l]|28l|37|), and the results of this section can then be 
transferred to this setting. However, this leads to unnecessary complications, which we shall 
avoid. 

5.1 Markov transition kernels and information constraints 

Let us remind the following definition (e.g. see lfT2l . Sections 2 and 5). 

Definition 2 (Markov transition kernel). Given two measurable sets {A,£/) and (B,^), a 
Markov transition kernel is a conditional probability measure P(A, [ b) G 3^{A) on {A,.sif), 
which is ^-measurable for each A,- G £^ . 

Markov transition kernel defines linear transformation 11 : 3^{B) — )• =^(A) between statis- 
tical manifolds =^(A) and !3^{B) as follows: 



Elements p G ^(A x B) are joint probabihty measures P(A; x Bj) = P(A, | Bj)P{Bj), and for 
P{Bj) > 0, the conditional probability is defined by the Bayes formula: 




PjAj x Bj) 
P{Bj) 
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Event a E A is statistically independent ofb€Bif and only if P(A,- | b) = P{Ai) for each b B 
and all A,- G £/. In this case, /'(A, x Bj) = P{Ai)P{Bj). On the other hand, a function a = f{b) 
defines deterministic dependency of a on b, and it corresponds to a deterministic transition 
kernel 

1 if m^Ai 



P(AH^) = 5,(,)(A,.):=| otherwise 



One can see that each joint probability measure € ^(A x B) defines a pair of mai^ginal 
and conditional probability measures P{B) and P{A \ B) or P{A) and P(B | A). Thus, points 
of ^(A X B) define all possible transition kernels, including all possible measurable functions 
between A and B. Hence the following classification. 

Definition 3 (Deterministic composite state). A joint probability measure p G ^(A x B) is 
deterministic, if and only if it defines a deterministic transition kernel 5f(j,-^ (A,) for some mea- 
surable function / : B — A or : A ^ B. Otherwise, p is non-deterministic. 

Transition kernels are often understood as communication channels giving a more tradi- 
tional meaning to the notion of information related to the process of sending messages between 
A and B. The amount of information communicated by P{Ai \ b) is measured by the Shannon 
mutual information |[33]|: 



Is{a,b} :-- 



AxB 



dP{a,b) 
dP{a)dP{b) 



dP{a,b) = [ dP{b) [ 
Jb J a 



dP{a \ b) 
^ dP{a) 



dP{a I b) (14) 



One can see that Is{a,b} is defined as information distance Ikl{p,']) '■= IEp{ln(p/^)} of joint 
measure p := P{Ai x Bj) from the product of marginals q := P{Ai)P{B j), or as the expectation 
of the information distance Ikl of the conditional probability P(A; | b) from the marginal P(A,), 
taken with respect to a fixed marginal P{Bj). 

Variational problems (O and ^ for composite systems and constraints on mutual informa- 
tion have been studied in information theory (e.g. Il33l 13411351 ). Note that when problems Q 
and ([3]) are considered on any measurable set Q., they are referred to in information theory 
as problems of the first kind 135). For a composite system H = A x B, one distinguishes 
between problems of the second and third kind. Observe that the amount of mutual infor- 
mation (fT4l) communicated depends on P{Bj), which we refer to as an the input or source 
distribution, and transition probabilities f (A, | b). In fact, Is{a,b} =H{b} — H{b \ a}, where 
H{b} := E,p{—lnP{b)} is the entropy of P{B), and H{b \ a} is the conditional entropy. Opti- 
mization problems over input distributions P(B) and with a fixed channel P(A, | b) are prob- 
lems of the second kind. Problems of the third kind are concerned with finding an optimal 
channel for a fixed set of input distributions. The results of previous sections allow us to con- 
sider a generalization of these problems when mutual information is defined by some other 
information distance I{p,q) between two joint states g G =^(A x B) or an information re- 
source F{p). Note that problems of the third kind play important role not only in information 
theory, but also in other areas including optimal statistical decisions, estimation, control and 
even in the theory of algorithms, as will be illustrated in Section [ 



5.2 Strict sub-optimality of deterministic kernels 

Observe that Pf{Ai x Bj) = df^},) {Ai)P{Bj) = for all f{b) ^ A,-. Thus, deterministic transition 
kernels can be defined only by joint states that are on the boundary of ^{A x B); interior 
points of i^(A X B) can define only non-deterministic transition kernels. The appUcation of 
Theorem [T] to the case 11 = A x B yields the following result. 
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Theorem 2 (Separation of deterministic and non-deterministic kernels). Let {pp}x C J^{A x 
B) be a family of joint probability measures maximizing expected value lEp{x} = {x,p) of 
function x : A x B — )• M on sets {p : F(p) < A} for all values X of a closed functional F : 
^ — > M U {oo}. If F*{x) := sup{ {x, p) — F{p)} is strictly convex and F is minimized at po £ 
dF*{0) C Int(=^(A X B)), then 

1. {pp}x contains deterministic pf if and only if it is a solution to an unconstrained prob- 
lem: X>X or {x,pf) = v:= x(X) = sup{{x,p) : p £ ^{A x B)}. 

2. The inequality 

{x,pf) < {x,pp) 

holds for all deterministic pf € 1^{A x B) such that F{pf) = F{pp) € (Ao, A). 

3. Similarly, the inequality 

F{Pf)>F{pp) 

holds for all deterministic pf € ^(A x B) such that {x,pf) = {x,pp) £ {vq,v). 

Proof. 1. (^) Assume there exists pf £ {pp}x for A < A (and {x,pf) < v), and such that 
the corresponding transition kernel is deterministic: P/(A,- | Bj) = 1 if A, = /(By) and 
Pf{A\Ai I Bj) = 0. In this case, pf := P/(A x B) is not in the interior of ^(A x B), 
because Pf{{A\f{Bj)) x Bj) = 0, and in particular pf does not minimize F, because 
dF*(0) C Int(^(A xB)) by our assumption. Thus, F{pf) = A G (Ao,A). But then 
Pf{{A\f{Bj)) xBj) =0 imphes that there exist p°p E {ppjx for all A G [Ao,oo] such 
that p°p := Pp{{A\f{Bj)) x Bj) = by Theorem [Tl In particular, there exists p^ € 
dF*{0) such that /q ((A \/(By)) x By) = 0, and therefore Pq is also not in the interior 
of ^(A x B). Thus, by contradiction we have proven Pf ^ {ppjx or X > X (and hence 
{x,pf)=v). 

(<^) If A > A, then there exists solution dx G ext,^(A x B) such that {x,dx) = l):= 
sup{{x,p) : p € ^} (by linearity of {x,-) and Krein-Milman theorem for and dx 
corresponds to some function f{b) = a. 

2. For all X G X and y £Y, the Young-Fenchel inequality holds: {x,y) < F*{x) -\-F{y). 
Moreover, it holds with equality if and only if j E dF*{x) (e.g. see [381, Chapter 2, 
Section 4.1, Lemma 3). Assume pp G dF*(^x). Then {x,pp) = p-^[F*{l5x) + F{pp)]. 
On the other hand, if pf is deterministic and F{pf) < A < A, then pf ^ dF*{^x) and 
therefore 

{x,pf)<^-'[F*{^x)+F{pf)]=^-'[F*{^x)+F{pp)] = {x,pp) 

3. By definition of the Legendre-Fenchel transform, F**{y) > {x,y) —F*{x), and the equal- 
ity holds if and only if ;c G dF**{y). Assume ^x G dF**{pp). Then F**{pp) = F{pp) = 
Pi^^Pj}) ~F*{px). On the other hand, if pf is deterministic and {x,pf) < V, then 
13 X ^ dF**{pf), and therefore 

F{pf) > F**ipf) >l5{x,pf)-F*{l3x) = l3{x,pp)-F%l3x) = F(pp) 

Note that jS > and F{pp) = A > Ao, if {x,pp) = v>Vq. 

□ 
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The assumptions of Theorem [2] are quite general. The relation of strict convexity of F* 
to separating property of information of variational problems for measures was discussed in 
Section l4!2l The assumption po € Int(^(A x B)) is very natural. Indeed, each facet of the 
simplex ^(A x B) is also a simplex of some subset of A x B. Therefore, the element pQ is 
always in the interior of some simplex ^(A,- x By), unless po = 5 € ext ^{A x B). In aU prac- 
tical cases, information is minimized at po ^ ext^(A x B). In particular, one often chooses 
Pq := P{Ai)P{Bj), so that a and b are independent, and supports of marginal probabilities 
P(A,) and P{Bj) include more than one element. 

To understand better the result of Theorem |2j we now recall some facts about mutual in- 
formation for deterministic kernels and then for exponential kernels, which are an important 
example of non-deterministic kernels. These facts will be used in a qualitative example, pre- 
sented later. 

5.3 Deterministic transition kernels 

Probability measure P{Ai) = YlfP{Bj) defined by a linear transformation with deterministic 
transition kernel 5y(i,)(A;) is sometimes denoted P/^'(A,) := P{b : f{b) € A,} (e.g. fT2|| . 
Section 2). If / : B ^ A is injective, then P/^^ (A,) = P{Bj) for each A; = f{Bj). 

Definition 4 (Measurable isomorphism). An injective and measurable function / : B — A is 
called a measurable monomorphism of B. If / is also surjective and f^^{a) is measurable, 
then / is a measurable isomorphism. 

We point out the following known result. 

Proposition 5 (Invertible transformation). A linear transformation IT : ^(B) — > .'^{A) of sta- 
tistical manifolds is invertible if and only if its Markov transition kernel is df(^i,-^{Ai), where f 
is a measurable isomorphism. 

Proof. (=^) Assume that the transition kernel of 11 is not defined by any function. Thus, 
115;, = p ^ ext J^{A) for some 5h € ext !3^{B). Without loss of generality, we can assume that 
p = [l —t)8ai +t8a2 foJ" some t G (0, 1), Sq,, da^ € ext^(A) such that 5o, / S^,. Then 



Because 8b € ext ^(B) is not aconvex combination of any points of ^(B), itimpUes Yl^Sa^ = 
Yl^ = 5b. But then 11^ ^ is not injective, because 5a^ / , and therefore 11 is not surjective. 
Thus, the transition kernel of an invertible 11 must be 5y(i)(A,) for some measurable function 
/ : B — )• A. Clearly, such 11 is invertible only if the mapping / : ext^(B) — > ext^(A) is 
injective, surjective, and both / and are measurable. 

(<;=) Obvious. □ 

Let us consider information communicated by a deterministic transition kernel 5y(/j)(A,). 
The maximum (or supremum) amount of information can be communicated if / is an injec- 
tive function, because preimage f^^{a) uniquely determines b. If a function is not injective, 
then Z> G /^'(a) is determined up to the probability \/\f^^{a)\. Indeed, for countable B and 
constant P(ft]Gthis can be shown as follows: 



U-'p = U-\{\-t)5a, +t5a2] = (1 - 




+ tU 




Pf{b I a) 



Pfia,b) 
Pf{a) 



8f(b){a)P{b) _ \.p{b) _ 1 



^The condition P{b) = const was omitted in the final version. 
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We can express the average amount of information communicated by function / by the fol- 
lowing injectivity index of /: 

^^^^ n\f-\a)\} - ^ 

Note that if B is finite, then we can compute the injectivity index as /(/) = Indeed, 
ZaefiB) = \B\, and so the average value of \f~\a)\ is |B|/|/(B)|. Thus, /(/) = 1 

for an injective function, and inf/(/) = corresponding to an empty function. For constant 
functions, /(/) = 1/|B|, and they communicate the least amount of information among non- 
empty functions. If B is finite, then /(/) < 1 implies \f{B)\ < \B\. This is not the case, 
however, for functions defined on an infinite set (e.g. /(/) = 1/2 for / : Z — )• N defined as 
f{b) = \b\, but \f{B)\ = \B\ = ^q). Let us show that if the image of a function is infinite, then 
one can always construct an input distribution P{B) such that the output distribution P/^' (A) 
has infinite entropy. 

Proposition 6 (Maximizing input distribution). Let {A,£/) and (5,=^) be infinite measurable 
sets, and let {/„} be a sequence of measurable fiinctions f„ . B with finite images. There 
exists a sequence of probability measures P„ on ^ such that 



\UB) 



lim \Hn{a} = - £ \n[Pnf;'{a)]Pnf-\a) 



aef„{B) 



Proof. It is sufficient to take P„ on B that induce under the mappings f„ : B constant (i.e. 
uniform) probability distributions on the images fn{B). For example, assuming without loss 
of generality that B is countable, define the following function on B: 



Pn{b) 



1 



1 



\fnm\fn'of„{b)\ 

It is a probability measure, because it is positive, additive and P„(B) = 1. Indeed 



< 



\f-\a)\ _ \f„{Bj) 



\fnm \fn' ofn{b)\ " l/« W I \fn\a)\ \fn{B)\ 

where equality holds if and only if Bj = f^^ of„(Bj). Then 

1^1 1 \f-\a)\ 1 



Pnfni^) 



\fn{B)\ 



befiv'ia) 



\fn'of„{b)\ \fn{B)\\f-\a)\ \fn{B)\ 



The entropy of P„/„ ^(a) is = In |/„(B)[, and it grows infinitely with \ f„{B)\. 



□ 



It follows from Proposition [6] that if the amount of information communicated by a de- 
terministic transition kernel 5j(fo)(A,) is finite for any input distribution P{Bj), then the image 
of / must be finite. Note that this argument is not based on any specific notion of mutual 
information. For Shannon information, one can show that the following inequality holds for a 
deterministic kernel 5y (/j)(A;): 



ls{a,b] 



heB aeA 



In 



beB 



In 



Pf 
1 



-1^ 



Pf-'of{b) 



<ln|/(B)| 



(15) 



22 



This inequality is obtained by maximizing Is{a,b} for a fixed deterministic kernel 5j(i)(A,) 
over all input distributions P{b). The supremum of Is{a,b} is achieved at P{b) inducing a 
constant distribution Pf^^{a) on A, such as the maximizing distribution in Proposition |6] 

5.4 Exponential kernels 

If the function / : B — )• A is not injective, then there exist input distributions P{B) with non-zero 
entropy such that Pf^^ (a) = 1 for some a € A. In this case, the output entropy H{a} is zero, 
and the transition kernel communicates no information. Moreover, if / : B — A has infinite 
domain and finite image, then its injectivity index is zero: lim|5|^„o = 0. This 

means that such a function can potentially 'loose' an infinite amount of information. Non- 
deterministic transition kernels, on the other hand, are quite different in this sense, because 
there exist kernels that always communicate some information. An important example are 
exponential transition kernels. 

Let Q. = AxB and x:AxB— >Mbea utility function. Consider variational problems Q 
and ([3]l with Ikl{p,q) '■= Ep{ln[/7/^]} defining Shannon mutual information (fT4l ). The unique 
solutions to these problems are joint probability measures € =^(A x B) that belong to a 
one-parameter exponential family: 

dPp {a,b)= W«.^)+<f (r ' )] dP{a) dP{b) , 

where <I>(j8^^) is determined from the normalization condition 

^-PMp-^)^f eP^'^''-'UP{a)dP{b) 

JaxB 

The corresponding exponential transition kernels are 

dPpia\b)=eP^'^"''^+'^^l'''-'^UP{a), dPpib \ a) = / 

where <I>(j3^^Z7) and <I>(j8^',c?) now depend on b and a, as they are computed using partial 
integrals: 



Ja JB 

If the product e^^''^ ^^^^ dP{b) does not depend on b, and e^^''^ ^'"^dP{a) does not de- 
pend on a, then exponential kernels do not depend on the marginal measures dP{a) and dP{b) 
respectively. Indeed, because dP{a) = JgdP{a,b) and dP{b) = Jj^dP{a,b), we have the fol- 
lowing equations 

f ^ma,b)+Mp-',b)]^p^i^^ = 1, /e/^w«'^)+<^'(r',«)] = 1 

Jb Ja 
Then, using the facts that e^^^l^ dP{b) and e^'^^^ ' "^1 dP{a) are constants, we obtain: 

^-/3<J.(r',^) = [^p(^)/j^] //-M)^^, e-^'^^f'''^"^ = [dP{a)/da\ feP'^"''^da 

Jb Ja 

Using these relations and the Bayes formula the exponential transition kernels can be written 
in the following simple form 
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Here, the normalizing integrals are constant, because they do not depend on a or b, and one 
can introduce the free energy function <I>o(j8^^) := — j8~^ In Jge^^^"''^^ db or the free cumulant 
generating function = — j8<I>o(j8^'). If one of the marginal distributions, say P{B), is 

fixed, then Shannon information has the following expression: 

Is{aM = I dP{a) I 

J A JB 

= J dP{a) J |/3x(a,Z^)-ln J e'^''^''^''Ub -hi[dP{b)/db]^ dP{b \ a) 

= pEpp{x}-^>o{p)+H{b}, (16) 

Observe also that the expected utility is the derivative of ^q{I5) = In J^e^^^"'''^ db: 

Here, H{b} = - fg\n[dP{b)/db]dP{b) is the differential entropy of P{B) (assuming that 
the density dP{b)/db exists). Also, because Is{a,b} = H{b} — H{b \ a}, the difference 
*Po(j8) — j3*PQ(j3) is the conditional differential entropy H{b \ a}. Expected utility defined 
by equation (fTTl) is independent of the input distribution P{B). 

One can show that the products e^"^^!^ '''^ dP{b) and e^'^^^ -"^ dP{a) are constant when 
A = (A, +) and B = {B, +) are equivalent locally compact groups with invariant measures da 
and db, and the utility function is translation invariant: x{a + c,b + c) = x{a,b). An impor- 
tant example is when A and B are equivalent linear spaces, and x{a,b) depends only on the 
difference a — b (e.g. x{a,b) = —^\\a — b\\^). In such cases, the simplified expressions and 
equations (fT6l) and (ITtI) can be applied. 

Joint exponential measures Pp are mutually absolutely continuous for all j8 > 0. Further- 
more, by Corollary [T] about the support of utility functions x{a,b) and due to normalization of 
probability measures, condition Pj}{Ai x Bj) = implies x{a,b) is constant on A, x Bj, and one 
may extend this to the case x{a,b) = As is well known, exponential distributions approxi- 
mate the Dirac 5-function for j8 — > oo. The corresponding joint probability measures define de- 
terministic transition kernels 5y^(^)(a), where function /is such thai x{f{b),b) = &\xp^^x{a,b), 
and one may include the case sup jc(a,Z>) = oo. 

5.5 Qualitative example 

Strict inequalities of Theorem [2] present an interesting opportunity for constructing an ex- 
ample such that {x,pf) = —o° or F{pf) = oa for any deterministic transition kernel satisfy- 
ing a proper information constraint F{p) < X < X or sl non-trivial expected utility constraint 
IEp{jc} = {x,p) > V > Vq. If solutions p^ to the corresponding variational problems exist, 
then inequalities {x,pp) > —oo or F{pp) < oo suggest that a non-deterministic transition ker- 
nel satisfying the same constraints may have a finite expected utility and information. Such 
an example would provide quahtative rather than quantitative illustration. Let us consider one 
prototypical example. 

Let a £ A and b G B be real variables, and let us consider the problem of information 
transmission between A and B that is optimal with respect to a measurable utility function 
x:AxB— >M. IfZ^S (M, J^,P) is a random variable with known distribution, then the expected 
utility Ep{x} is: 

Ep{x} =11 x{a,b)dP{a,b) = f dP{b) f x{a,b)dP{a \ b)= f Ep{x \ b}dP{b) 
JaJb Jb J a Jb 



In 



dP{b I a) 
dP(b) 



dP{b \ a) 
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Here E,p{x \ b} denotes the conditional expected utility, and it is maximized by choosing the 
optimal conditional probability measure dP{a \ b). The maximum of information is communi- 
cated by an injective function a = f{b), defining a deterministic transition kernel. The optimal 
function is such that x{f{b),b) = sup^^j^x{a,b). On the other hand, if no information can be 
communicated, then dP{a \ b) = dP{a). A deterministic kernel communicating no informa- 
tion is defined by a constant function. Note, however, that one can still choose an optimal 
constant function ai = f{b). Indeed, if x{a,b) is differentiable and concave in a, then d\ is 
a solution to the equation V a Jgx{a,b)dP{b) = 0. In particular, if x{a,b) = —^{a — b)^, then 
V a j^x{a,b) dP{b) = Jg{b — a)dP{b), and di = JgbdP{b) = E,p{b}, which is the well-known 
classical method minimizing mean-squared deviation. Thus, for constant f{b) = ai 

^PfM = -bfdP{b) < -lJ^{Ep{b}-bfdPib) = -^Var{H 

The value on the right depends on the distribution P(B), and there are many examples of distri- 
butions with unbounded variance, such as dP{b) = [n{b^ + l)]^^db (the Cauchy distribution). 
Indeed, the integral Jg{a — bY{b^ + 1)^' db does not converge on B = (—00,00). 

Let us assume now that some limited information can be communicated so that dP{a \ b) ^ 
dP{a) (and hence dP{b \ a) / dP{b)). For example, this can be the information associated 
with b belonging to some subset of B, such as b > or b <0. In each case, one can choose 
different optimal elements d\ and d2. A more 'precise' information would correspond to a 
larger number of subsets B, C B and optimal elements dj, such that 



KpfU}<-^^t I {ai-bfdP{b) 



■ 1=1"'^' 

Observe that the value above still depends on P{B), and because for any finite partition of the 
real line there are some unbounded intervals, one can take P{B) giving a negatively infinite 
value on the right. For example, if P{B) is the Cauchy distribution, then the integral J{a — 
b)^{b^ + l)db does not converge on the intervals Bi = (—00,0] or B2 = [0,°°]. Thus, b can 
be distributed in such a way that the expected value of utility x{a,b) = —^{a — bY cannot be 
larger than —00 for any deterministic pf with finite image \ f{B)\. The expected utility can have 
finite values only if / has an infinite image. By the argument of Proposition |6l however, this 
means that the function can communicate an infinite amount of information. Let us show now 
that there exist non-deterministic transition kernels for this problem achieving finite expected 
utility and communicating finite amount of information. 

Indeed, consider an exponential kernel from Section [53J optimal for constraints on Shan- 
non mutual information. Because the utility function x{a,b) = —^{a — b)^ is translation in- 
variant x{a + c,b + c) =x{a,b), we can use the simplified expressions from Section \5A\ In 
particular, ^o{li) = In y^2;rj3^\ and the exponential kernel is Gaussian 

Conditional expectation Ep^ {x \ b} is constant for all b ^ B: 



and therefore 



¥.p^{x} = j¥.p^{x\b}dP{b) = -^fi- 
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The expression above can also be easily obtained from equation (ITtI ) as the derivative of 
*Po(i8) = In ^yiTip^. The optimal value j3^' > depends on the amount A of mutual in- 
formation, and it can be computed using equation (fT6l) by inverting A = Is{a,b}: 



The value jS depends on the difference H{b} — A, which equals to the conditional differential 
entropy H{b \ a}, because Is{a,b} = H{b} — H{b \ a} = X. Therefore, if H{b \ a} is finite, 
then j8 > 0, and Ep^ {x} is finite for all A > 0. 

Other examples can be constructed using the same principles. For instance, if A = B = N, 
and the utility function x{a,b) is a polynomial of degree m > 1, then one can distribute b £ B 
according to P{b) = [b'"^^ l^{m+ 1)]^^ where l^{k) = Y.ben^ '^ is the Riemann zeta function. 
In this case, the expected utility is negatively infinite for any deterministic kernel 5j(/,)(a), if 
/ has finite image satisfying a finite information constraint. The optimal transition kernels 
satisfying both finite expected utility and finite information constraints in such problems are 
non-deterministic. These examples demonstrate that deterministic and non-deterministic tran- 
sition kernels are qualitatively different, because their expected utilities can be separated by 
infinity. 

5.6 Application: Deterministic and non-deterministic algorithms 

Because Markov transition kernels give a non-deterministic generahzation of functions, they 
can be used to model various input-output or information processing systems. Computational 
machines and algorithms are examples of such systems, and we now discuss how they can be 
represented by transition kernels and the corresponding variational problems. Results of this 
work may have interesting applications to the study of algorithms and computation. 

An algorithm T is defined as a system of computations transforming input words wq in 
some finite alphabet into output (e.g. final) words Wt (e.g. |24|). Each word in the domain of 
definition of Y can be considered as initial word wq. In a deterministic algorithm, the compu- 
tation process is performed by a sequence of transformations /(w,) = Wt+\ of words, where 
7 is called the direct processing operator ||2TI or a transition function. In a non-deterministic 
algorithm, these transitions are randomized according to some local probabilities. The com- 
putational process may terminate reaching a final word (answer), terminate without reaching 
a final word (error) or continue the computations indefinitely. In addition, when computation 
terminates with a non-final word, one may distinguish between errors of the first and second 
kinds (i.e. false positives and false negatives). Algorithms may be restricted to run in poly- 
nomial time of the size of input words or produce only certain types of errors (i.e. one-sided 
errors). 

The computational cost of r(wo) can be associated with resources or complexity of com- 
putations, such as the length of the output sequence (wi, . . . ,W;), if is final: 



A Boolean loss function can be defined by 5^{l{r{wQ),wo)), where 5oo(-) indicates an error 
(i.e. one, if the algorithm does not terminate or terminates with a non-final word). A utility 
of computation can be defined by any function proportional to negative loss, such as Boolean 
utility x(r(wo),wo) = 1 — 8oo{liT{wQ),wo)). Maximization of expectation Ep{x} for Boolean 
utihty is maximization of the probability that computation terminates with a final word. 



j3 =27re 



\-2[H{h}-X] 




t if r(wo) = {w\ , . . . , w,) and Wj is a final word 
oo otherwise 
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Both deterministic and non-deterministic algorithms compute a function from the set of 
input words wq, for which the computation terminates with an answer, onto the set of final 
words Wf. The main difference is that a non-deterministic algorithm can compute the pair 
(wo,w,) in different ways and with different running times, so that the cost or utility of a 
non-deterministic computation is a random variable. We can represent algorithms by Markov 
transition kernels as follows. 

Let B be the set of all input words wq, and let A be the set of all, possibly infinite, output 
word sequences {w;}. A deterministic algorithm corresponds to a deterministic Markov tran- 
sition kernel 5p(^) (a), so that each input word is mapped to a particular output word sequence: 
B 9 Wo I— 5- r(wo) = (wi, . . . ,W;, . . .) G A. A non-deterministic algorithm assigns non-zero prob- 
abilities Pr{a I b) to different output sequences. We say that two algorithms are equivalent, if 
they correspond to identical Markov transition kernels. Points in the set ^{AxB), which is a 
Choquet simplex, correspond to equivalence classes of all deterministic and non-deterministic 
algorithms, defined on B, together with all distributions P{B) of input words. This formalism 
allows us to consider optimization of algorithms in the context of variational problems ^ 
and their generalizations. 

Indeed, optimization of a class of algorithms subject to constraint Ep{Z} < u on the ex- 
pected loss or a constraint E^^jx} > u on the expected utility has been considered in complexity 
theory (e.g. see 1 16|). For example, the complexity class of bounded error probabilistic poly- 
nomial time machines (BPP) is defined as a class of problems solved by non-deterministic 
algorithms with constraints on the expected error (i.e. Ep{x} > u > 1/2, where x is Boolean 
utility). Information constraints have also been considered in complexity theory, such as con- 
straints on communication capacity (communication complexity) or in the class of probabilis- 
tically checkable proofs (PCP), which is defined as a non-deterministic algorithm with con- 
straints on randomness and a number of queries to an oracle (i.e. a constraint on information 
amount about the proof). Problems of optimization of algorithms can be considered as a search 
for the corresponding class of optimal Markov transition kernels (i.e. variational problems of 
the third kind in information theory). The optimal value functions ©-([S]) put the expected 
utility constraint Ep{x} > u in duality with a constraint F{p) < A on an information resource. 
Thus, the study of performance and computational complexity of the algorithms is related to 
the study of their information constraints. 

6 Discussion 

We have studied families of optimal measures using a generalization of the classical varia- 
tional problems of information theory fST, "S?^ and statistical physics (Tf]. In fact, standard 
formulae of these theories relating Gibbs measures, free energy, entropy and channel capacity 
can be recovered simply by defining information constraints using the Kullback-Leibler di- 
vergence. The main motivation for the generalization was understanding the mutual absolute 
continuity of measures within optimal families, and it was established that such families ex- 
ist if an abstract information resource has a strictly convex dual, which is a geometric rather 
than algebraic property of information. We have discussed also that strict convexity of the 
dual functional is related to separability of different variational problems, which is useful in 
the context of optimization. Our method does not depend on commutativity of the algebra 
of random variables or observables, and for this reason the result holds both for commutative 
(classical) and non-commutative (quantum) measures. 

Mutual absolute continuity of optimal probability measures allowed us to show that de- 
terministic transition kernels are strictly sub-optimal. This result is important not only for 
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applications of optimization theory, but also for some theoretical questions in studies of al- 
gorithms and computational complexity, where much of the effort is devoted to the question 
whether non-deterministic procedures have any qualitative advantage over deterministic. Our 
results suggest that in a broad class of optimization problems with constraints on informa- 
tion optimal deterministic kernels do not exist. Moreover, an example has been constructed 
to show that the difference between expected utilities of deterministic and non-deterministic 
kernels can be infinite for all proper constraints on an information resource. 

These results about strict sub-optimality of deterministic kernels do not contradict the 
established understanding in the classical theory of statistical decisions that asymptotically 
randomized policies cannot be better than deterministic (e.g. see f35l or more recently f2T\). 
Indeed, these asymptotic results are concerned with obtaining all, possibly infinite amount of 
information, in which case there are deterministic optimal kernels. Our results, on the other 
hand, are about optimality subject to constraints making such asymptotic solutions unfeasible. 
Note also that a simple randomization of a function's output can only decrease (loose) the 
amount of information it communicates. However, we have compared deterministic and non- 
deterministic kernels that can communicate the same amount of information. The possibility 
to separate deterministic and non-deterministic transitions qualitatively (i.e. by infinity) is 
particularly interesting, because it confirms a common intuition in applied optimization about 
numerous problems, in which non-deterministic algorithms outperform all known determinis- 
tic methods. 
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